Unsupervized Word Segmentation: the Case for Mandarin Chinese
نویسندگان
چکیده
In this paper, we present an unsupervized segmentation system tested on Mandarin Chinese. Following Harris's Hypothesis in Kempe (1999) and Tanaka-Ishii's (2005) reformulation, we base our work on the Variation of Branching Entropy. We improve on (Jin and Tanaka-Ishii, 2006) by adding normalization and viterbidecoding. This enable us to remove most of the thresholds and parameters from their model and to reach near state-of-the-art results (Wang et al., 2011) with a simpler system. We provide evaluation on different corpora available from the Segmentation bake-off II (Emerson, 2005) and define a more precise topline for the task using cross-trained supervized system available off-the-shelf (Zhang and Clark, 2010; Zhao and Kit, 2008; Huang and Zhao, 2007)
منابع مشابه
Segmentation non supervisée : le cas du mandarin
Unsupervized Word Segmentation In this paper, we present an unsupervised segmentation system tested on Mandarine Chinese. Following Harris’s Hypothesis in Kempe (1999) and Tanaka-Ishii (2005) reformulation, we base our work on the Variation of Branching Entropy. We improve on (Jin et Tanaka-Ishii, 2006) by adding normalization and Viterbi-decoding. This enables us to remove most of the threshol...
متن کاملCross-linguistic generalization of the distal rate effect: Speech rate in context affects whether listeners hear a function word in Chinese Mandarin
Recent findings show that altering the speech rate of the context several syllables away from a word (i.e., the distal context) can cause the word to disappear in perception in non-tonal Indo-European languages like English [1] and Russian [2]. This study investigated the distal rate effect in Chinese Mandarin, a tonal language belonging to the Sino-Tibetan language family. We examined whether ...
متن کاملCan MDL Improve Unsupervised Chinese Word Segmentation?
It is often assumed that MinimumDescription Length (MDL) is a good criterion for unsupervised word segmentation. In this paper, we introduce a new approach to unsupervised word segmentation of Mandarin Chinese, that leads to segmentations whose Description Length is lower than what can be obtained using other algorithms previously proposed in the literature. Suprisingly, we show that this lower...
متن کاملChinese text word-segmentation considering semantic links among sentences
Tokenization of Chinese input text into words is a necessary step to realize a Mandarin Chinese text-to-speech. Several word-segmentation algorithms were developed in which linguistic information are combined with statistical ones or with heuristic rules. In this paper we investigate in the advantages that can arise when semantic relation among sentences is taken into account during the word se...
متن کاملSemi-supervised Chinese Word Segmentation for CLP2012
Chinese word segmentation (CWS) lays the essential foundation for Mandarin Chinese analysis. However, its performance is always limited by the identification of unknown words, especially for short text such as Microblog. While local context are helpless in handling unknown words, global context do manifest enough contextual information, and could be used to guide CWS process. Based on this moti...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012